Novel data sources

Georeferenced or georeferencable data are increasingly available on the web. However, identifying relevant information for geospatial applications and analysis requires sifting through different data types as well as various levels of organization and accuracy. Goodchild’s prediction that we are increasingly becoming sensor is increasingly becoming a reality with increasing numbers of volunteered photos, reviews, or delivered tweets that include location information identifying where the content was captured, accessed, or recorded. Such data represent a large geographically representative and likely socially diverse data source.
For example, websites often have business addresses that can be mined for analysis of their location. Such platforms require more intensive tailored algorithms to collect this data unlike social media which is often accompanied by an Application Programming Interfaces(APIs). APIs are specially designed protocols that allow developers to access web content from specific platforms (e.g. twitter, facebook). While such interfaces can enables scraping of this content, they also contain data limits and in some cases restriction to prevent large amounts of data collection that would tax servers. Such limits are in place to protect the anonymity of users as well as to protect data that is increasingly being commodified.

Web scraping

Web scraping is the process of retrieving and parsing content from the web. In its simplest form, web scraping is a copy-and-paste operation, manually identifying a webpage’s content and organizing this information into relational databases. In fact, copy and paste techniques can be a solution when websites have barriers preventing machine automated scraping. Much of the time, however, the underlying structure of websites makes it possible to automate such procedures, for more efficient collection of the content.

Web scraping requires deciphering the structure of a website and developing algorithms that parse relevant content. Most websites display content organized based on the structure of a database, making it possible to develop algorithms that systematically query this data and extract it as needed. For example, on tourism advertising website, the phone number, address, and name will likely be stored according to a systematic hierarchy. Through detection of these templates, parsing programs or wrappers are able to translate this content into a relational form. Wrappers can be designed to interpret Hypertext Markup Language (HTML) codes, web browser schemes (Document object Model – DOM), the Uniform Resource Locator (URL) common scheme, and semantic markups (e.g semantic annotation recognition). Semantic markups simplify the development of wrappers because tags are known beforehand or are easily interpreted. Extensible Markup Language (XML), and JavaScript Object Notation (JSON) are particularly prominent annotations. For example, JSON annotations is used by Google Maps to encode address data including tags that indicate place names, address and coordinates. Algorithm development is rarely straightforward as data on the web is often unstructured or loosely structured, containing superfluous information that needs to be filtered out in order to obtain georeferenced or georeferencable web content.

Web crawling

Web crawling is the automated and methodical browsing of the web using programs or automated scripts (web crawlers, web spiders or web robots) and can be used in conjunction with web scraping in order to obtain content from multiple web pages. Like websites, webpages are typically encoded similarly within page hierarchies by common scripts or templates, which allow for systematic information retrieval (provided these templates are utilized). Algorithms can be used to detect systematic differences in URLs to retrieve data from multiple pages. In addition to gathering information from a particular database accessible via the web, web crawling may involve identifying links within websites and following these links, and links from the linked website, until a desired network is reached.

Processing of web content

High-level programming languages offer a large range of scripts or packages that can detect website templates for parsing web content based on text grepping and regular expression matching. Grepping isolates keywords or patterns based on the UNIX grep command or regular expression-matching. Carefully applied, this enables the user to capture specific, well-defined sections of a page’s HTML tree. Such techniques are helpful in processing the large amounts of data generated through web crawled or scraped data by sifting and parsing all (non)relevant key terms or number sequences.

Lecture Demonstration

In today’s lab we will be looking under the hood of website to give you some rudimentary understanding of web content. We will further give some suggestions on how to scrape this webcontent. Finally, I will demonstrate how to access the twitter API for scraping tweets.

  1. To start, I am going to navigate to The Guardian US website and choose a new article that I think is interesting. (Some of the content is obviously going to be dated with the print screens)

  1. Now if we right click on the page and choose Inspect we see a window that shows lines of code. What is this exactly? Well, it is basically the backend of the page. It organizes the graphics on the page(e.g. advertisements, pictures), as well as, information (e.g. the article content). This information is stored in a nested hierarchy.

  1. If we hover over the different elements in the main window, these elements will be highlighted, revealing different components of the website.

  1. If we do a little exploring that open the different nested element by clicking on the different arrows, we can find specific elements within the website. For example, this is the first paragraph of the article.

  1. Notice the different HTML tags that are associated with the data. At the top we have <div class="content__head content.... and then we have <p>, which in HTML means a paragraph. This infromation let us know where we might find the article text on The Guardian website. There are several different webscrapers that can do this for you, but in the exercise today we will only be working with APIs. APIs organize tags that can be quickly accessed by developers. Platforms structure their sites in ways that others can develop application based on their content. For example, if we go to the developer page of Twitter we find resources to access tweets.

  1. To actaully publish and analyze tweets, we are required by most platforms to register to get APIs. This helps them monitor who is doing what. To do this, you will need a twitter account. Sign up or sign into your existing account to start.

  1. Once you have logged-in navigate back to the developers page roll down and choose solution - Academic-Research or other as you wish.

  1. Click on Apply for an account.

  1. We now enter the apply for access page. Click on the Apply for a developer account botton.

  1. Choose the primary reason for using Twitter developer tools. This is up to you, but I used Academic and Student.

  1. Fill the details.

10.Now we fill out some details about how we will use the data. Confirm on the next page that everything looks good!

  1. Review and Accept the developer aggreement. You should review confirmation shortly.

  1. After you have been confirmed that you have a developer account, we can make an App (This may take a day or two- remember to answer any follow-up questions). Click on the Apps button in the developers website and in the new window press Create an app.

  1. In the App details fill in a name, description, website url, click Enable. Sign in with Twitter and finally tell how it will be used (the other stuff is not required).

  1. After it is successfully created. You will able to see the details of your new App. Importantly for us, we will need to collect the Consumer API Keys and the Acess token and Token Secret. These can be found at the Key and tokens tab. Create the tokens and copy both the keys and tokens and paste them in a safe location which you will save for later. So now we are all set to map tweets… well actaully there is a rather large amount of coding involved if you go by the develpers website. Luckly for us there is an R package that helps us easily query Tweets.

rtweet

Is an R function designed to collect and organize Twitter data via Twitter’s REST and stream Application Program Interfaces (API), which can be found at the following URL: https://developer.twitter.com/en/docs. It allows us to listen to the current stream of Tweets, as well as, query from Twitter’s database of Tweets.

rtweet documentation can be found here and here.

Exercise

  1. First we will want to install the appropriate R packages. This probably means using the install.packages() function to grab rtweet,tidytext, and stringr. We will also install ggspatial, sf and rnaturalearth, which are we might demonstrate at the end if we have time.
#possible use an older version if having issue with 
#Error: lexical error: invalid char in json text. 
#devtools::install_version("rtweet", version = "0.6.7")
library(rtweet)
# plotting and pipes - tidyverse!
library(ggplot2)
## Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
## when loading 'dplyr'
library(dplyr)
# text mining library
library(tidytext)
library(stringr)
library(maps)
library(tidyverse)
library(ggspatial)
library(sf)
library(rnaturalearth)
  1. We now have to setup authorization code so that Twitter will allow us to access the data. rtweet does this using a few lines of code. First we give the details of the App that we made on the developer website. This includes the name of the app and the consumer_key, consumer_secret, access_token, and access_secret. These should remain secret, and I have left these blank for this exercise.
## Give the name of the App 
appname <- "Geoscraping"

api_key <- "XXXXXXXXXXX"
api_secret_key <- "XXXXXXXXXXXXXXXXXXXXXXXXXXXX"
access_token <- "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
access_token_secret <- "XXXXXXXXXXXXXXXXXXXXXXXXXX"
  1. Now we will authenticate our app online with Twitter. This is done by feeding the different keys and passwords into the rtweet authentication.
## authenticate via web browser
token <- create_token(
  app = appname,
  consumer_key = api_key,
  consumer_secret = api_secret_key,
  access_token = access_token,
  access_secret = access_token_secret)
  1. After authentication we can search the Twitter database. This returns a collection of relevant Tweets matching a specified query - q. The search is not exhaustive. Not all Tweets will be indexed or made available via the search interface. The specific query elements included in the rtweet function can be found here. The q argument is the character query that will be searched. For example, we can search q = "climate change" to look for Tweets with mentioning the word ‘climate’ and ‘change’ (Spaces act as AND). We can also use boolean operators in the search. For example, if we use q = "climate OR change" this will expand the serarch to Tweets that mention climate or change. Using single quotes q = '"climate change"' will return the specific term ‘climate change’

Other important arguments in the function include: n=10, which specifies the number of tweets you want returned; type=, you want recent(default), popular, or a mix - mixed Tweets returned; include_rts if you want to include retweets e.g. include_rts = TRUE; and geocode, which specifies whether a specific geographic area. This is important for us as we want to grab Tweets with their coordinates. Use geocode = TRUE to map Tweets.

Let’s try to search for Tweets from SEAS. We will use the hastage, and indicate that we only want 5 tweets.

SEAS_tweets <- search_tweets(q = "#SEAS", n = 5)
## Searching for tweets...
## Finished collecting tweets!
## Here we can see some of the information that is returned from the query
names(SEAS_tweets)
##  [1] "user_id"                 "status_id"              
##  [3] "created_at"              "screen_name"            
##  [5] "text"                    "source"                 
##  [7] "display_text_width"      "reply_to_status_id"     
##  [9] "reply_to_user_id"        "reply_to_screen_name"   
## [11] "is_quote"                "is_retweet"             
## [13] "favorite_count"          "retweet_count"          
## [15] "hashtags"                "symbols"                
## [17] "urls_url"                "urls_t.co"              
## [19] "urls_expanded_url"       "media_url"              
## [21] "media_t.co"              "media_expanded_url"     
## [23] "media_type"              "ext_media_url"          
## [25] "ext_media_t.co"          "ext_media_expanded_url" 
## [27] "ext_media_type"          "mentions_user_id"       
## [29] "mentions_screen_name"    "lang"                   
## [31] "quoted_status_id"        "quoted_text"            
## [33] "quoted_created_at"       "quoted_source"          
## [35] "quoted_favorite_count"   "quoted_retweet_count"   
## [37] "quoted_user_id"          "quoted_screen_name"     
## [39] "quoted_name"             "quoted_followers_count" 
## [41] "quoted_friends_count"    "quoted_statuses_count"  
## [43] "quoted_location"         "quoted_description"     
## [45] "quoted_verified"         "retweet_status_id"      
## [47] "retweet_text"            "retweet_created_at"     
## [49] "retweet_source"          "retweet_favorite_count" 
## [51] "retweet_retweet_count"   "retweet_user_id"        
## [53] "retweet_screen_name"     "retweet_name"           
## [55] "retweet_followers_count" "retweet_friends_count"  
## [57] "retweet_statuses_count"  "retweet_location"       
## [59] "retweet_description"     "retweet_verified"       
## [61] "place_url"               "place_name"             
## [63] "place_full_name"         "place_type"             
## [65] "country"                 "country_code"           
## [67] "geo_coords"              "coords_coords"          
## [69] "bbox_coords"             "status_url"             
## [71] "name"                    "location"               
## [73] "description"             "url"                    
## [75] "protected"               "followers_count"        
## [77] "friends_count"           "listed_count"           
## [79] "statuses_count"          "favourites_count"       
## [81] "account_created_at"      "verified"               
## [83] "profile_url"             "profile_expanded_url"   
## [85] "account_lang"            "profile_banner_url"     
## [87] "profile_background_url"  "profile_image_url"
  1. As you can see, the data returned is large. Some of the data is interesting including: the tweet text, user_id, created_at, screen_name, favorite_count, retweet_count, and geo_coords. Let’s look at this data. (Feel free to explore)
SEAS_tweets$text
## [1] "Strong #winds, rough #seas alert \n\n#WeatherForTheWeekAhead #WeatherUpdate #WeatherForecast #WeatherReport  #RainyDay #lka #SriLanka @MeteoLK \n\nhttps://t.co/gwwSL0P3Yl"                                                                                                                                      
## [2] "Beware of calm seas..\n#quotes #seas #shallow #deeper https://t.co/8PxRRXfUMT"                                                                                                                                                                                                                                   
## [3] "#SEAS #Bears_of_Wall_Street SeaWorld: Bankruptcy Risk Is Real https://t.co/9QXJaahPfY"                                                                                                                                                                                                                           
## [4] "The best hikes are not always the longest but the one's with the best views! #nature #ocean #seas #lookout #beaches #mountains #naturephotographry #outdoors #hiker #backpacking #daytrip #outdoorfeverdirect https://t.co/kGY3nffV3T"                                                                           
## [5] "First threshold value for #GoodEnvironmentalStatus to be adopted under the EU #MarineDirective. A great example of how sound science, political willingness and public pressure can come together to improve the state of European #seas and #ocean. \n\n#MSFD #EUBeachCleanup #OurOcean https://t.co/ZaswFiVXPK"
SEAS_tweets$user_id
## [1] "3244922072"          "104223961"           "770087685738340353" 
## [4] "1280909896754454528" "1306130924975906816"
SEAS_tweets$created_at
## [1] "2020-09-21 17:51:05 UTC" "2020-09-21 10:04:30 UTC"
## [3] "2020-09-21 09:26:03 UTC" "2020-09-21 07:54:01 UTC"
## [5] "2020-09-21 07:53:11 UTC"
SEAS_tweets$screen_name
## [1] "sumanebot"       "EmperatrizPages" "designyourinves" "outdoorfd"      
## [5] "alice_belin"
SEAS_tweets$favorite_count
## [1] 0 1 0 0 3
SEAS_tweets$retweet_count
## [1] 2 0 0 0 0
SEAS_tweets$geo_coords
## [[1]]
## [1] NA NA
## 
## [[2]]
## [1] NA NA
## 
## [[3]]
## [1] NA NA
## 
## [[4]]
## [1] NA NA
## 
## [[5]]
## [1] NA NA
  1. Here we see the details of these 5 Tweets. This will change daily as new tweets are added and stored into the database. Not all will have the coordinates. In order to ensure coordinates we can use the geocode = argument in the function. We have to specify the location of the tweets using coordinates. We specify the geographical limiter using the template “latitude,longitude,radius”. Here we use the coordinates of
    Ann Arbor. You can lookup coordinate using Google maps (which has the same template).
geo_tweets <- search_tweets("lang:en", geocode = "42.28139,83.74833,100mi", n = 5)
## Searching for tweets...
## Finished collecting tweets!
geo_tweets$geo_coords 
## [[1]]
## [1] NA NA
## 
## [[2]]
## [1] NA NA
## 
## [[3]]
## [1] NA NA
## 
## [[4]]
## [1] NA NA
## 
## [[5]]
## [1] NA NA
  1. Unfortunately, there are one or few tweets with a geolocation. Tweeting coordinates is opt-in, and there are few that actually reveal their location. Twitter will supply a feed from the area for all tweeters in that area, but only those who have opted to tweet their coordinates will show up with “geo” info. The coordinates object is only present (non-null) when the Tweet is assigned an exact location. If an exact location is provided, the coordinates object will provide a [long, lat] array with the geographical coordinates, and a Twitter Place that corresponds to that location will be assigned. If you choose to map tweets, using larger areas or the twitter stream for mapping tweets is often a good option.
geo_tweets <- search_tweets("lang:en", geocode = "42.28,-83.74,100mi", n = 1000)
## Searching for tweets...
## Finished collecting tweets!
  1. Now that we have increased the sample size, we should have some specific coordinate. To map these we find all the Tweets with coordinates.
## create lat/lng variables using all available tweet and profile geo-location data
geo_tweets_coord <- lat_lng(geo_tweets)
nrow(geo_tweets_coord)
## [1] 999
  1. Now we plot them using maps
## plot state boundaries
par(mar = c(0, 0, 0, 0))
maps::map("state", fill = TRUE, col = "#ffffff", 
  lwd = .25, mar = c(0, 0, 0, 0), 
xlim = c(-90, -82), y = c(41, 48))
## plot lat and lng points onto state map
with(geo_tweets_coord, points(lng, lat, pch = 20, cex = .75, col = rgb(0, .3, .7, .75)))

Stream tweets

We can also access the live twitter stream. This include a random sample (approximately 1%) of the entire live stream of all Tweets.

  1. Here we just grab all the tweets for 30 seconds and then 10 seconds.
## random sample for 30 seconds (default)
live <- stream_tweets(q="")
## Streaming tweets for 30 seconds...
## Finished streaming tweets!
## opening file input connection.
## closing file input connection.
nrow(live)
nrow(stream_tweets("", timeout = 10))
## Streaming tweets for 10 seconds...
## Finished streaming tweets!
## opening file input connection.
## closing file input connection.
  1. As you can see you can control the length of time that you ‘listen’ to the stream. We can also define a location for where to listen. In this case, we are grabbing Tweets from New Delhi. Strangely, there are tweets that are outside of New Delhi. I have never understood why that is (I don’t use Tweets all that much and have not taken the time to figure it out). I suspect this is a geotag by the Tweeter that may not correspond to the defined location. Bonus if you can tell me why.
New_Delhi <- stream_tweets(q="",  geocode = "28.61,77.21,100km", timeout = 60)
## Streaming tweets for 60 seconds...
## Finished streaming tweets!
## opening file input connection.
## closing file input connection.
## Grab the coordinates
New_Delhi <- lat_lng(New_Delhi)
par(mar = c(0, 0, 0, 0))
maps::map("world", lwd = .25)
## plot lat and lng points onto state map
with(New_Delhi, points(lng, lat, pch = 20, cex = .75, col = rgb(0, .3, .7, .75)))

  1. Finally, we can change the data into a sf object, which makes it easier to map in different packages (e.g. ggplot, cartography). We will need to use the sf package, which doesn’t like NAs. So first we filter these out. We use the !is.na with the place_type tag, as the place object is always present when a Tweet is geo-tagged, while the coordinates object is only present (non-null) when the Tweet is assigned an exact location.
## here we are filtering out NA
Delhi_coord <- filter(New_Delhi, !is.na(place_type))
## Now we change the data into a sf object. We define the coordinates 
## xlim, ylim (longitude, latitude), and provide an appropriate projection
Delhi_coord <- st_as_sf(Delhi_coord, coords = c("lng", "lat"), crs = 4326)
## using ggplot and the ```library(rnaturalearth)```, which provides vector 
## for countries and the globe, we can map the points
world <- ne_countries(scale = "medium", returnclass = "sf")

ggplot(data = world) +
    geom_sf() +
    geom_sf(data = Delhi_coord, size = 4, shape = 23, fill = "darkred") ### here we specify the size

### shape and color of the points to be mapped

In lab this week, we will be learning how to scrape Flickr photographs and how to use keywords in the captions to identify specific photographs.